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Convergence Analysis of Policy Iteration 

Ali Heydari^ 


Abstract 

Adaptive optimal control of nonlinear dynamic systems with deterministic and known dynamics under a known undiscounted 
infinite-horizon cost function is investigated. Policy iteration scheme initiated using a stabilizing initial control is analyzed in 
solving the problem. The convergence of the iterations and the optimality of the limit functions, which follows from the established 
uniqueness of the solution to the Bellman equation, are the main results of this study. Furthermore, a theoretical comparison between 
the speed of convergence of policy iteration versus value iteration is presented. Finally, the convergence results are extended to 
the case of multi-step look-ahead policy iteration. 


I. Introduction 

This short study investigates the convergence of the policy iteration (PI) as one of the schemes in implementation of 
adaptive/approximate dynamic programming (ADP), sometimes referred to by reinforcement learning (RL) or neuro-dynamic 
programming (NDP), IT]- ifTTIl . 

Compared to its alternative, i.e., value iteration (VI), the PI calls for a higher computational load per iteration, due to a ‘full 
backup’ as opposed to a ‘partial backup’ in VI, ifT^ . However, the PI has the advantage that the control under evolution remains 
stabilizing Co], hence, it is more suitable for online implementation, i.e., adapting the control ‘on the fly’. The convergence 
analyses for PI with continuous state and control spaces and an undiscounted cost function are given in Hol . The results 
presented in this study however, are from a different viewpoint with different assumptions and lines of proofs. Moreover, 
interested readers are referred to the results from a simultaneous research (at least in terms of the availability of the results to 
the public) presented in ifTTll . which are the closest to the first two theorems of this study. 

This study establishes the convergence of the PI to the solution to the optimal control problem with known deterministic 
dynamics. Moreover, given the faster convergence of PI compared with VI which can be observed in numerical implementations, 
some theoretical results are presented which compare the rates of convergences. Finally, the multi-step look-ahead variation of 
PI, m, is analyzed and its convergence is established. 

II. Problem Formulation 

The discrete-time nonlinear system given by 

Xk+i = f{xk,Uk), k (1) 

is subject to control, where (possibly discontinuous) function / : R” x K™ —K" is known, the state and control vectors are 
denoted with x and u, respectively, and /(0,0) = 0. Positive integers n and m denote the dimensions of the continuous state 
space M" and the (possibly discontinuous) control space U C M"*, respectively, sub-index k represents the discrete time index, 
and the set of non-negative integer numbers is denoted with N. The cost function subject to minimization is given by 

OO 

J = '^U{xk,Uk), (2) 

k=0 

where the utility function U : R" xU ^ R+ is positive semi-definite with respect to the first input, and positive definite with 
respect to its second input. The set of non-negative real numbers is denoted with R_|_. 

Selecting an initial feedback control policy h : R" — U, i.e., Uk = h{xk), the adaptive optimal control problem is 
updating/adapting the control policy such that cost function ^ is minimized. The minimizing control policy is called the 
optimal control policy and denoted with h*{.). 

Notation 1. The state trajectory initiated from the initial state Xq and propagated using the control policy h{.) is denoted 
with a;^, Vfc G N. In other words, Xq := Xg and x^_^_^ = f{x^,h{x^)^,'ik G N. 

Definition 1. The control policy h{.) is defined to be asymptotically stabilizing within a domain if linrik^ooX^ = 0, for every 
initial Xq selected within the domain, snui- 

Definition 2. The set of admissible control policies (within a compact set), denoted with TL, is defined as the set of policies 
h{.) that asymptotically stabilize the system within the set and their respective ‘cost-to-go’ or ‘value function’, denoted with 
Vh ■ R” —> R-r and defined by 

OO 

Vk{xo):=Y,U{xlh{x'i)), (3) 
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is upper bounded within the compact set by a continuous function V : R" —>■ K+ where ^(0) = 0. 

If the value function itself is continuous, the upper boundedness by V{.) is trivially met, through selecting y(.) = 14(.). 
Note that the continuity of the upper bound in the compact set leads to its finiteness within the set, and hence, the finiteness 
of the value function. This is a critical feature for the value function and hence, the control policy. 

Assumption 1. There exists at least one admissible control policy for the given system within a connected and compact set 
n C R" containing the origin. 

Assumption 2. The intersection of the set of n-vectors x at which [/(x,0) = 0 with the invariant set o//(.,0) only contains 
the origin. 

Assumption [T] leads to the conclusion that the value function associated with the optimal control policy is finite at any point 
in n, as it will not be greater than Vh{.) at that point, for any admissible control policy h{.). Assumption |2] implies that the 
optimal control policy will be asymptotically stabilizing, as there is no non-zero state trajectory that can ‘hide’ somewhere 
without convergence to the origin. Given these two assumptions, it is concluded that the optimal control policy is an admissible 
policy, i.e., h*{.) S %. 


III. ADP-based Solutions 
The Bellman equation H, given below, provides the optimal value function 

V*{x) = min (u[x,u) + V* [f[x,u))'^ , (4) 

which once obtained, leads to the solution to the problem, through 

h*(x) = argmin (u[x,u)+V*[f[x,u))'\. (5) 

But, this is mathematically impracticable for general nonlinear systems, 11. Policy iteration (PI) provides a learning algo¬ 
rithms for training a function approximator or forming a lookup table, for approximating the solution m, m, d- This 
approximation is done within a compact and connected set, containing the origin, called the domain of interest and denoted 
with n. 

Starting with an initial admissible control policy, denoted with /i°(.), one iterates through the policy evaluation equation 
given by 

V\x) = U{x,h\x))+V^(^f{x,h\x))^yx G n, (6) 

and the policy update equation given by 

lT''^^{x) = argmin (u(^x,u) + y*(/(a;,u))'),Va; G fl, (7) 

for i = 0,1,... until they converge, in PI. Each of these equations may be evaluated at different points in 17, for obtaining the 
targets for training the respective function approximators. 


IV. Convergence Analysis oe Policy Iteration 

Given the fact that Eqs. (H and (|7]) are iterative equations, the following questions arise. 1- Does the iterations converge? 2- 
If they converge, are the limit functions optimal? This section is aimed at answering these two questions. Initially the following 
two lemmas are presented. 


Lemma 1. Given admissible control policies h{.) and g(.), if 

U{x,h{x)) + Vg(^f{x,h{x))^ < Vg{x),\/x G 17, 

then Vh{x) < Vg{x),\/x G 17. 

Proof: Evaluating ([8]l at Xq G 17, one has 

[/(xo^M4)) +K;(4) < G 17. 

Also, evaluating ([8]l at x\ leads to 

U{x\,h{x^^)) +Vg{x^) < Eg(xJ),VxJ G 17. 

Using (fTOl) in (|9l) leads to 

+[/(xJ,h(x^)) +Ug(x2^) < G 17. 

Repeating this process for N — 2 more times leads to 

N-l 

Y, U{xlh{x'i)) +Vg{x%) < Ug(4),Vxo^ G 17. 

fc =0 


( 8 ) 


(9) 

( 10 ) 

( 11 ) 


( 12 ) 
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Letting N ^ oo and given Vg{x) > 0,Va:, which hence can be dropped from the left hand side, inequality (fT^ leads to 
Vh{x) < Vg{x),\/x G n, by definition of 14(.). □ 


Assuming two admissible control policies h{.) and g{.), Lemma [T] simply shows that if applying h{.) at the hrst time step 
and applying g{.) for infinite number of times in the future, leads to a cost not greater than only applying g{.), then, the value 
function of h{.) also will not be greater than that of g{.), at any point. 

Lemma 2. Given admissible control policies h{.) and g{.), if 

Vh{x) < Vg{x), 3x e O, (13) 

then 

U{x, h{x)) + Vg (^f{x, h{x))'^ < Vg{x), 3x G fl. (14) 

Proof : The proof is done by contradiction. Assume that (fT4l i does not held, i.e., 

c/(4,m 4)) + ^9(4) > e (15) 

which leads to 

l( 4 ,l( 4 )) +yg( 4 ) > e n. (16) 

Using (fThl) in (fTsT i leads to 

U{x’^,h{x'i))+U{xlh{x1))+Vg{x!i) > Ug(4),v4 G U. (17) 

Repeating this process for N — 2 more times, one has 

N-l 

Y, U{xlh{xl)) +Vg{x%) > Ug(4)>v4 e (18) 

k=0 

Let TV —>■ oo. Given the admissibility of g{.) and h{.), one has Vg{x%) -G 0, Vccq, as —>■ oo. The reason is limAr^oo x% -G 0 
and the continuity of the upper bound of Vg{.), per the admissibility of g{.). Therefore, Inequality (fTsT i contradicts (foT l. because, 
the second term in the left hand side of (fTSI) can be made arbitrarily small. Hence, (fT^ leads to Vh{x) > Vg{x),'ix G H 0, 
which contradicts (fOl) . hence, (fTSl) cannot hold. This completes the proof. □ 

In simple words. Lemma |2] shows that if the value function of h{.) is less than that of g{.) at least at one x, then, the cost 
of applying h{.) only at the first step and applying g{.) for the rest of the steps also will be less than the cost of only applying 
g{.) throughout the horizon, at least at one x. This result leads to the uniqueness of the solution to the Bellman equation (|4|i, 
as shown in the next theorem. 

Theorem 1. The Bellman equation given by (0 has a unique solution in U. 

Proof : The proof is by contradiction. Assume that there exists some 14(.) that satishes 

14(x) = mn (u{x,u) + T4(/(x, m))^ , Vx G H, (19) 

while, U*(x) < 14(x), 3x G fl, in other words h*(x) ^ h(x),3x G U, where 

h(x) := argmin (u[x,u) + 14 (/(x, u))y Vx G U. (20) 

Using Lemma |2] inequality U*(x) < 14(x),3x G U, leads to 

U{x,h*{x)) +Vh(^f{x,h*{x))^ < 14(x) = U{x,h{x)) +Vh(^f{x,h{x))^,3x G ft. (21) 

But, (|21 T i contradicts (l20l i. Hence, h*{x) = h{x),\lx G U, and therefore, U*(x) = 14(x),Vx G U. □. 

The next step is the proof of convergence of PI. 

Theorem 2. The policy iteration given by equations and m converges monotonically to the optimal solution in U. 

Proof: The first step is showing the monotonicity of the sequence of value functions {U*(x)}“q generated using the PI 
equations. By 0, one has 

C/(x,/i*+i(x)) +U*(/(x,/i*+i(x))) < U4x),Vx G U. (22) 

Using Lemma 121 the former inequality leads to 

U*+i(x) < U4x),Vx G H, (23) 

^This conclusion can also be made using another contradiction argument, through o which leads to Vy(xo) + e = Vg{xo),3xQ E for some 
e = e(a)o) > 0. Then, selecting large enough N such that Vg(x^) < e, inequality sd contradicts (l3). 
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for any selected i. Hence, pointwise decreasing. On the other hand, it is lower bounded by the optimal value 

function. Hence, it converges, El. Denoting the limit value function and the limit control policy with l/°“(.) and 
respectively, they satisfy PI equations 


= U{x,h^{x)) +V^{f{x,h°°{x))),\lx e 

and 

h°^{x) = argmin (u{x, u) + V°°[f[x, u))'), Vcc G fl, 

u&A ^ > 


hence, 

V°°{x) = min (u(^x,u) + V^(f(x,u))'j,Vx G H. 

Eq. (l26l l is the Bellman equation, which per Theorem [T] has a unique solution. Hence, l/°°(.) 
completes the proof. 


(24) 

(25) 

(26) 

y*(.) everywhere in H. This 

□ 


Theorem 3. The control policies at the iterations of the policy iteration given by equations and (0 remain admissible in 

n. 


Proof: Given the requirements for admissibility, one needs to show that each policy is asymptotically stabilizing and its 
respective value function is upper bounded by a continuous function which passes through the origin. The latter follows from 
the monotonicity of the sequence of value functions under VI, established in Theorem|2l since /i°(.) is admissible. The former, 
also follows from this monotonicity, as no state trajectory can hide in the set at which the utility function is zero, without 
convergence to the origin, per Assumption |2] In other words, in order for its value function to be bounded, the policy needs 
to steer the trajectory towards the origin. □ 


V. Comparison between Policy and Value iterations 
Value iteration (VI), as an alternative to PI, is conducted using an initial guess W^{.) and iterating through the policy update 


equation given by 

g''{x) = argmin (u{x,v) + W^[f[x,u))'\,'ix G H, 

(27) 

and the value update equation 

W^+\x) = U{x,g\x))+W^(^f{x,g\x))),yx G H. 

(28) 

The two former equations can 

be merged into 



W^^^{x) = min (jj[x,u) + W^[f[x,u))'^,\/x G H. 

(29) 


for i = 0,1, where notations 1V*(.) and p*(.) are used for the value function and the control policy resulting from the VI, 
respectively, for clarity. The convergence proof of VI is not the subject of this study and can be found in many references 
including ITg), ISl, and IfTTI . 

The VI has the advantage of not requiring an admissible control as the initial guess. The PI, however, has the advantage that 
the control policies subject to evolution remain stabilizing for the system. It was shown in 1201 that if the VI is also initiated 
using an admissible initial guess, the control policies remain stabilizing. Therefore, starting with an admissible guess, the VI 
and PI seem to be similar in terms of stability. The computational load per iteration in VI is significantly less than that of 
PI, due to needing to do a simple recursion in VI using (l28l l. called a ‘partial backup’ in IT^ . as compared with solving an 
equation in PI, namely, Eq. (|6]l, which is a ‘full backup’, lT2l . However, in practice, it can be seen that the PI converges much 
faster than the VI, in terms of the number of iterations. This section is aimed at providing some analytical results confirming 
this observation. 

Theorem 4. //V°(.) = W°{.) is calculated using as admissible control policy, the policy iteration given by equations and 
m converges not slower than the value iteration given by equations I l27l ) and \28^ . in H. 

Proof: Given the convergence of both schemes to the unique V*{.), the claim is proved by showing that V’'{x) < W’’ (x),Vx G 
fl,Vi G N. Prom V°(.) = W^{.) one has h^{.) = Hence, 

W\x) = U{x,g\x))+W'^(^f{x,g\x)))=U{x,h^{x))+V^(^f{x,h\x))) > 

U{x,h^{x)) + V^(^f{x,h^{x))^ = V^{x),^x G 

where the inequality is due to the monotonically decreasing nature of {V*(x)}“g established in Theorem |2] Therefore, 
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( 31 ) 


W^{x) > V^{x),\/x. Now, assume that W'^(x) > V'^{x),\/x, for some i. Then, 

W^+\x) = U{x,g\x)) + W^(^f{x,g\x))) > U {x, g\x)) + (^f {x, g\x))) > 

C/(a;,h*+i(a;)) +y*(/(a;,/i*+i(a;))) > U{x,h^+^{x))+V'-+^(^f{x,h^+^{x))^ = V^+^{x),\/x e fl, 

The hrst inequality is due to the assumed W'^{x) > F*(x),Va;. The second inequality is due to the fact that is 

the minimize! of the term subject to comparison, and the last inequality is due to the monotonicity of (a^)}“o- Hence, 
W'^'^^{x) > F*+^(a:),Vx S n, and the claim is proved by induction. □ 

It should be noted that the result given in the former theorem is probably very conservative, as it only shows that the 
convergence of the PI will not be ‘slower’ than that of the VI. 


VI. Convergence Analysis oe Multi-step Look-ahead Policy Iteration 

Multi-step Look-ahead Policy Iteration (MLPI), a, is a variation of PI, given by the policy evaluation equation (|6l) repeated 
below 

V\xo) = L(xo,h*(xo)) +L*(/(xo,h*(xo))),Vxo G 11, 
and the new policy update equation with n-step look-ahead (n G N, n > 0) given by 

n—1 

/i*+^(xo) = argmin ( U{x'j^, h{x^)) + V'‘{x^)\\lx^ = xo G fl. (32) 

^ fe=o ^ 

It can be seem that the regular PI is a special case of the MLPI with II = 1, 0- It is not surprising to expect the MLPI to 
converge faster than the regular PI, as in the extreme case that n oo, the optimal solution will be calculated in one iteration, 
using ( |32] |. i.e., the iterations converge to the optimal solution after the very hrst iteration. The rest of this section provides 
the convergence analysis for MLPI, for 1 < n < oo. 

Theorem 5. The multi-step look-ahead policy iteration given by equations and 021 ) converges monotonically to the optimal 
solution in fl. 


Proof-. The proof is similiar to the proof of Theorem|2] Initially it is shown that the sequence of value functions {I^*(x)}“g 
generated using the MLPI is monotonically decreasing. By ( |32] |. one has 




(33) 


fc =0 


which is the consequence of being the minimize! of the left hand side of the former inequality. Using the line of proof 

in the proof of Lemma |2l inequality ([33]) may be repeated in itself for inhnite number of times to get 

V^+^{x) <V\x),'ix (34) 


which is valid for any selected i. Hence, {I4*(x)}^g under the MLPI is pointwise decreasing. It is also lower bounded by the 
optimal value function, therefore, converges, M- Denoting the limit value function and the limit control policy with U°°(.) 
and h°°(.), respectively, they satisfy the MLPI equations 

U°°(xo) = U(xo,L°°(xo)) +U°“(/(xo,/i“(xo))),Vxo G (35) 


and 


hence. 


L-(x[j) = argmin ( ^ U(x^h(x^)) + (x^^)), Vx(] = Xq G U, 

k=0 


V°°{xo) = min U(xfc,/i(xfe)) -f U“(xJj)),Vxo = xq G H. 


(36) 


(37) 


Eq. (IJTI) is the n-step look-ahead version of the Bellman equation 0 and U*(.) satishes it, by dehnition. It can be proved 
that this equation also has a unique solution, which is U*(.), using the line of proof in Lemma [2] and Theorem [T] To this end, 
assume that 


U*(x) < U°“(x),3x G H, (38) 

hence, h*{x) f h°°{x),3x G H. Inequality (l38l) leads to 

n—1 

^U(xf,L*(xf))+U°°(xf) <U“(xf),3xS* GU. (39) 
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Otherwise, one has 


n—1 

^C/(a:f,/i*(a:f))+y°°(4*) >l^“(4*),V4* GO, (40) 

k=0 

which repeating it in itself for unlimited number of times, and considering the fact that y°°(.) in the left hand side can be 
made arbitrarily small after a large enough number of repetitions, leads to 

( 41 ) 

But, dTTTi contradicts (l3^ . hence, (l38] l leads to (l3^ . But, (l3^ contradicts h*{x) ^ h°^{x), 3a; G O, per (1^ . given h*{.) G 'H. 
Therefore, (l3^ cannot hold and V*{x) = V^{x),\/x G 0, which completes the proof. □ 


VII. Conclusions 

The convergence of the policy iteration scheme to the solution of optimal control problems was analyzed. The speed of 
convergence of the policy iteration was shown to be not slower than that of the value iteration. Finally, the convergence of the 
multi-step look-ahead policy iteration to the optimal solution was established. 
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